{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using RAGatouille with llama-hub loaders\n",
    "\n",
    "Use any loader from llama-hub with ragatouille!\n",
    "\n",
    "Basically we should be able to load anything from https://llamahub.ai/?tab=loaders which opens up a lot of avenues for RAGatouille!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install llama-hub\n",
    "# !pip install arxiv\n",
    "# !pip install semanticscholar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from ragatouille import RAGPretrainedModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PubMed Loader\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:59:06] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783939&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783816&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783752&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783720&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783708&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783652&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783611&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783392&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783308&db=pmc\n",
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783278&db=pmc\n",
      "\n",
      "\n",
      "[Jan 14, 20:59:22] #> Note: Output directory .ragatouille/colbert/indexes/vaccine_papers already exists\n",
      "\n",
      "\n",
      "[Jan 14, 20:59:22] #> Will delete 10 files already at .ragatouille/colbert/indexes/vaccine_papers in 20 seconds...\n",
      "#> Starting...\n",
      "[Jan 14, 20:59:44] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n",
      "  0%|          | 0/10 [00:00<?, ?it/s]/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:59:44] [0] \t\t #> Encoding 605 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " 10%|█         | 1/10 [00:03<00:34,  3.85s/it]/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n",
      "100%|██████████| 10/10 [00:33<00:00,  3.36s/it]\n",
      "WARNING clustering 76995 points to 4096 centroids: please provide at least 159744 training points\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 21:00:18] [0] \t\t avg_doclen_est = 133.9619903564453 \t len(local_sample) = 605\n",
      "[Jan 14, 21:00:18] [0] \t\t Creating 4,096 partitions.\n",
      "[Jan 14, 21:00:18] [0] \t\t *Estimated* 81,047 embeddings.\n",
      "[Jan 14, 21:00:18] [0] \t\t #> Saving the indexing plan to .ragatouille/colbert/indexes/vaccine_papers/plan.json ..\n",
      "Clustering 76995 points in 128D to 4096 clusters, redo 1 times, 20 iterations\n",
      "  Preprocessing in 0.00 s\n",
      "  Iteration 19 (4.44 s, search 4.38 s): objective=16826.5 imbalance=1.402 nsplit=0       \n",
      "[0.035, 0.038, 0.037, 0.033, 0.033, 0.036, 0.035, 0.033, 0.034, 0.035, 0.034, 0.035, 0.035, 0.035, 0.034, 0.034, 0.032, 0.035, 0.033, 0.035, 0.035, 0.036, 0.036, 0.035, 0.033, 0.034, 0.036, 0.036, 0.035, 0.04, 0.034, 0.037, 0.035, 0.035, 0.034, 0.031, 0.036, 0.035, 0.036, 0.04, 0.036, 0.036, 0.035, 0.034, 0.033, 0.033, 0.035, 0.037, 0.035, 0.034, 0.033, 0.034, 0.035, 0.034, 0.034, 0.037, 0.039, 0.036, 0.039, 0.034, 0.033, 0.034, 0.036, 0.036, 0.037, 0.034, 0.035, 0.037, 0.033, 0.035, 0.035, 0.033, 0.035, 0.034, 0.035, 0.036, 0.034, 0.035, 0.038, 0.038, 0.035, 0.035, 0.033, 0.036, 0.034, 0.034, 0.036, 0.038, 0.033, 0.04, 0.034, 0.037, 0.034, 0.035, 0.035, 0.035, 0.04, 0.035, 0.036, 0.034, 0.036, 0.038, 0.034, 0.033, 0.036, 0.034, 0.034, 0.032, 0.035, 0.033, 0.036, 0.037, 0.037, 0.032, 0.035, 0.033, 0.034, 0.035, 0.035, 0.034, 0.033, 0.033, 0.035, 0.036, 0.033, 0.036, 0.035, 0.032]\n",
      "[Jan 14, 21:00:22] [0] \t\t #> Encoding 605 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0it [00:00, ?it/s]\n",
      "  0%|          | 0/10 [00:00<?, ?it/s]\u001b[A\n",
      " 10%|█         | 1/10 [00:03<00:32,  3.61s/it]\u001b[A\n",
      " 20%|██        | 2/10 [00:07<00:29,  3.68s/it]\u001b[A\n",
      " 30%|███       | 3/10 [00:11<00:26,  3.74s/it]\u001b[A\n",
      " 40%|████      | 4/10 [00:15<00:22,  3.79s/it]\u001b[A\n",
      " 50%|█████     | 5/10 [00:18<00:18,  3.79s/it]\u001b[A\n",
      " 60%|██████    | 6/10 [00:22<00:15,  3.80s/it]\u001b[A\n",
      " 70%|███████   | 7/10 [00:26<00:11,  3.79s/it]\u001b[A\n",
      " 80%|████████  | 8/10 [00:30<00:07,  3.79s/it]\u001b[A\n",
      " 90%|█████████ | 9/10 [00:33<00:03,  3.79s/it]\u001b[A\n",
      "100%|██████████| 10/10 [00:35<00:00,  3.56s/it]\u001b[A\n",
      "1it [00:36, 36.30s/it]\n",
      "100%|██████████| 1/1 [00:00<00:00, 3509.88it/s]\n",
      "100%|██████████| 4096/4096 [00:00<00:00, 307811.25it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 21:00:59] #> Optimizing IVF to store map from centroids to list of pids..\n",
      "[Jan 14, 21:00:59] #> Building the emb2pid mapping..\n",
      "[Jan 14, 21:00:59] len(emb2pid) = 81047\n",
      "[Jan 14, 21:00:59] #> Saved optimized IVF to .ragatouille/colbert/indexes/vaccine_papers/ivf.pid.pt\n",
      "#> Joined...\n",
      "Done indexing!\n"
     ]
    }
   ],
   "source": [
    "from llama_index import download_loader\n",
    "\n",
    "RAG = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")\n",
    "PubmedReader = download_loader(\"PubmedReader\")\n",
    "\n",
    "loader = PubmedReader()\n",
    "documents = loader.load_data(search_query=\"covid 19 vaccine\")\n",
    "\n",
    "list_vaccine_papers = [document.text for document in documents]\n",
    "\n",
    "RAG.index(\n",
    "    collection=list_vaccine_papers,\n",
    "    index_name=\"vaccine_papers\",\n",
    "    max_document_length=180,\n",
    "    split_documents=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'content': 'A review of the challenges assessing the clinical efficacy of vaccines against SARS-CoV-2 Lancet Infect Dis. 2021 Feb 1 21 2 e26 35 https://pubmed.ncbi.nlm.nih.gov/33125914/ doi: 10.1016/S1473-3099(20)30773-8 33125914  8 World Health Organization. Coronavirus disease (COVID-19): Herd immunity, lockdowns and COVID-19 [Internet]. [cited 2022 Jun 22].',\n",
       "  'score': 23.274442672729492,\n",
       "  'rank': 1},\n",
       " {'content': 'Safety and efficacy of COVID-19 vaccines: a systematic review and meta-analysis of different vaccines at phase 3 Vaccines (Basel) 2021 9 989 34579226  26 Ismail II Salama S A systematic review of cases of CNS demyelination following COVID-19 vaccination J Neuroimmunol 2022 362 577765 34839149  27 Hsiao Y-T Tsai M-J Chen Y-H Acute transverse myelitis after COVID-19 vaccination Medicina (Kaunas) 2021 57 1010 34684047  28 Ismail II Salama S Association of CNS demyelination and COVID-19 infection: an updated systematic review J Neurol 2022 269 541 576 34386902  29 Román GC Gracia F Torres A',\n",
       "  'score': 23.064647674560547,\n",
       "  'rank': 2},\n",
       " {'content': 'These guidelines emphasize the importance of influenza vaccination as a preventive measure to protect individuals with HF from the potential complications and adverse outcomes associated with influenza infection 1 27 40 41 The effectiveness of influenza vaccines in lowering cardiovascular events among people with coronary heart disease has been supported by numerous studies 42 44 44 45 A meta-analysis conducted by Gupta et al 46 P P P 46 A recent meta-analysis conducted by Rodrigues et al 47 I 2 I 2 47 In 2022, a multinational randomized controlled trial conducted by Loeb et al 48 48 Influenza vaccination has the potential to decrease adverse cardiovascular events, although the certainty of available evidence is currently low.',\n",
       "  'score': 22.028385162353516,\n",
       "  'rank': 3}]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k = 3\n",
    "# search in 50 documents for relevant papers\n",
    "results = RAG.search(query=\"efficacy of vaccine\", k=k, index_name=\"vaccine_papers\")\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Semantic Scholar Loader\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:44:05] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "[Jan 14, 20:44:07] #> Creating directory .ragatouille/colbert/indexes/biases_llms \n",
      "\n",
      "\n",
      "#> Starting...\n",
      "[Jan 14, 20:44:09] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n",
      "  0%|          | 0/2 [00:00<?, ?it/s]/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:44:09] [0] \t\t #> Encoding 92 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " 50%|█████     | 1/2 [00:03<00:03,  3.56s/it]/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n",
      "100%|██████████| 2/2 [00:05<00:00,  2.54s/it]\n",
      "WARNING clustering 10906 points to 1024 centroids: please provide at least 39936 training points\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:44:14] [0] \t\t avg_doclen_est = 124.78260803222656 \t len(local_sample) = 92\n",
      "[Jan 14, 20:44:14] [0] \t\t Creating 1,024 partitions.\n",
      "[Jan 14, 20:44:14] [0] \t\t *Estimated* 11,479 embeddings.\n",
      "[Jan 14, 20:44:14] [0] \t\t #> Saving the indexing plan to .ragatouille/colbert/indexes/biases_llms/plan.json ..\n",
      "Clustering 10906 points in 128D to 1024 clusters, redo 1 times, 20 iterations\n",
      "  Preprocessing in 0.00 s\n",
      "  Iteration 19 (0.20 s, search 0.20 s): objective=2226.81 imbalance=1.447 nsplit=0       \n",
      "[0.037, 0.037, 0.035, 0.032, 0.032, 0.035, 0.039, 0.033, 0.033, 0.035, 0.033, 0.033, 0.036, 0.04, 0.033, 0.035, 0.032, 0.034, 0.031, 0.035, 0.033, 0.036, 0.031, 0.037, 0.035, 0.034, 0.038, 0.033, 0.034, 0.035, 0.034, 0.039, 0.035, 0.033, 0.032, 0.031, 0.035, 0.034, 0.032, 0.036, 0.035, 0.036, 0.032, 0.034, 0.034, 0.032, 0.037, 0.036, 0.037, 0.035, 0.031, 0.037, 0.041, 0.036, 0.035, 0.033, 0.037, 0.035, 0.037, 0.034, 0.034, 0.036, 0.039, 0.036, 0.034, 0.035, 0.031, 0.034, 0.031, 0.034, 0.036, 0.031, 0.034, 0.036, 0.035, 0.034, 0.039, 0.036, 0.034, 0.036, 0.035, 0.034, 0.034, 0.035, 0.034, 0.033, 0.033, 0.034, 0.032, 0.036, 0.034, 0.039, 0.033, 0.039, 0.036, 0.034, 0.037, 0.032, 0.034, 0.039, 0.037, 0.039, 0.034, 0.034, 0.034, 0.035, 0.035, 0.03, 0.036, 0.033, 0.036, 0.035, 0.036, 0.032, 0.038, 0.034, 0.036, 0.034, 0.032, 0.034, 0.034, 0.034, 0.033, 0.039, 0.032, 0.036, 0.036, 0.035]\n",
      "[Jan 14, 20:44:15] [0] \t\t #> Encoding 92 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0it [00:00, ?it/s]\n",
      "  0%|          | 0/2 [00:00<?, ?it/s]\u001b[A\n",
      " 50%|█████     | 1/2 [00:03<00:03,  3.45s/it]\u001b[A\n",
      "100%|██████████| 2/2 [00:04<00:00,  2.47s/it]\u001b[A\n",
      "1it [00:04,  4.98s/it]\n",
      "100%|██████████| 1/1 [00:00<00:00, 5974.79it/s]\n",
      "100%|██████████| 1024/1024 [00:00<00:00, 364567.29it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:44:19] #> Optimizing IVF to store map from centroids to list of pids..\n",
      "[Jan 14, 20:44:19] #> Building the emb2pid mapping..\n",
      "[Jan 14, 20:44:19] len(emb2pid) = 11480\n",
      "[Jan 14, 20:44:19] #> Saved optimized IVF to .ragatouille/colbert/indexes/biases_llms/ivf.pid.pt\n",
      "#> Joined...\n",
      "Done indexing!\n"
     ]
    }
   ],
   "source": [
    "from llama_hub.semanticscholar import SemanticScholarReader\n",
    "\n",
    "RAG = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")\n",
    "query_space = \"biases in language models\"\n",
    "\n",
    "s2reader = SemanticScholarReader()\n",
    "documents = s2reader.load_data(\n",
    "    query_space, limit=50, full_text=False\n",
    ")  # install pypdf2 to get full text\n",
    "\n",
    "# iterate over documents and get \"text\" field from each document and store in list_documents\n",
    "list_documents = [document.text for document in documents]\n",
    "\n",
    "RAG.index(\n",
    "    collection=list_documents,\n",
    "    index_name=\"biases_llms\",\n",
    "    max_document_length=180,\n",
    "    split_documents=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New index_name received! Updating current index_name (biases_llms) to biases_llms\n",
      "Loading searcher for index biases_llms for the first time... This may take a few seconds\n",
      "[Jan 14, 20:44:31] #> Loading codec...\n",
      "[Jan 14, 20:44:31] #> Loading IVF...\n",
      "[Jan 14, 20:44:31] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n",
      "[Jan 14, 20:44:31] #> Loading doclens...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:00<00:00, 3802.63it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:44:31] #> Loading codes and residuals...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 1/1 [00:00<00:00, 660.52it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:44:31] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:44:32] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n",
      "Searcher loaded!\n",
      "\n",
      "#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==\n",
      "#> Input: . demographic biases, \t\t True, \t\t None\n",
      "#> Output IDs: torch.Size([32]), tensor([  101,     1, 15982, 13827,  2229,   102,   103,   103,   103,   103,\n",
      "          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,\n",
      "          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,\n",
      "          103,   103])\n",
      "#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "        0, 0, 0, 0, 0, 0, 0, 0])\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'content': 'The Unequal Opportunities of Large Language Models: Examining Demographic Biases in Job Recommendations by ChatGPT and LLaMA Warning: This paper discusses and contains content that is offensive or upsetting. Large Language Models (LLMs) have seen widespread deployment in various real-world applications. Understanding these biases is crucial to comprehend the potential downstream consequences when using LLMs to make decisions, particularly for historically disadvantaged groups. In this work, we propose a simple method for analyzing and comparing demographic bias in LLMs, through the lens of job recommendations. We demonstrate the effectiveness of our method by measuring intersectional biases within ChatGPT and LLaMA, two cutting-edge LLMs. Our experiments primarily focus on uncovering gender identity and nationality bias; however, our method can be extended to examine biases associated with any intersection of demographic identities.',\n",
       "  'score': 24.941387176513672,\n",
       "  'rank': 1},\n",
       " {'content': 'Occupational Biases in Norwegian and Multilingual Language Models In this paper we explore how a demographic distribution of occupations, along gender dimensions, is reflected in pre-trained language models. We give a descriptive assessment of the distribution of occupations, and investigate to what extent these are reflected in four Norwegian and two multilingual models. To this end, we introduce a set of simple bias probes, and perform five different tasks combining gendered pronouns, first names, and a set of occupations from the Norwegian statistics bureau. We show that language specific models obtain more accurate results, and are much closer to the real-world distribution of clearly gendered occupations. However, we see that none of the models have correct representations of the occupations that are demographically balanced between genders.',\n",
       "  'score': 24.29305076599121,\n",
       "  'rank': 2},\n",
       " {'content': 'Our experiments primarily focus on uncovering gender identity and nationality bias; however, our method can be extended to examine biases associated with any intersection of demographic identities. We identify distinct biases in both models toward various demographic identities, such as both models consistently suggesting low-paying jobs for Mexican workers or preferring to recommend secretarial roles to women. Our study highlights the importance of measuring the bias of LLMs in downstream applications to understand the potential for harm and inequitable outcomes. Our code is available at https://github.com/Abel2Code/Unequal-Opportunities-of-LLMs.',\n",
       "  'score': 24.261198043823242,\n",
       "  'rank': 3}]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k = 3\n",
    "# search in 50 documents for relevant papers\n",
    "results = RAG.search(query=\"demographic biases\", k=k, index_name=\"biases_llms\")\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PDF Loader\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "from llama_index import download_loader\n",
    "\n",
    "RAG = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")\n",
    "\n",
    "PDFReader = download_loader(\"PDFReader\")\n",
    "\n",
    "loader = PDFReader()\n",
    "documents = loader.load_data(file=Path(\"data/llama2.pdf\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "[Jan 14, 20:44:39] #> Creating directory .ragatouille/colbert/indexes/llama2 \n",
      "\n",
      "\n",
      "#> Starting...\n",
      "[Jan 14, 20:44:42] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n",
      "  0%|          | 0/7 [00:00<?, ?it/s]/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:44:42] [0] \t\t #> Encoding 413 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " 14%|█▍        | 1/7 [00:05<00:34,  5.72s/it]/Users/shaurya/github/RAGatouille/.conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n",
      "100%|██████████| 7/7 [00:35<00:00,  5.05s/it]\n",
      "WARNING clustering 68914 points to 4096 centroids: please provide at least 159744 training points\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:45:18] [0] \t\t avg_doclen_est = 175.64407348632812 \t len(local_sample) = 413\n",
      "[Jan 14, 20:45:18] [0] \t\t Creating 4,096 partitions.\n",
      "[Jan 14, 20:45:18] [0] \t\t *Estimated* 72,541 embeddings.\n",
      "[Jan 14, 20:45:18] [0] \t\t #> Saving the indexing plan to .ragatouille/colbert/indexes/llama2/plan.json ..\n",
      "Clustering 68914 points in 128D to 4096 clusters, redo 1 times, 20 iterations\n",
      "  Preprocessing in 0.00 s\n",
      "  Iteration 19 (4.18 s, search 4.13 s): objective=16924.9 imbalance=1.444 nsplit=0       \n",
      "[0.036, 0.04, 0.036, 0.036, 0.035, 0.04, 0.039, 0.036, 0.036, 0.038, 0.037, 0.037, 0.036, 0.037, 0.037, 0.039, 0.035, 0.039, 0.035, 0.037, 0.037, 0.039, 0.037, 0.039, 0.034, 0.037, 0.039, 0.036, 0.038, 0.042, 0.037, 0.04, 0.039, 0.036, 0.037, 0.035, 0.039, 0.037, 0.037, 0.041, 0.039, 0.038, 0.036, 0.037, 0.037, 0.036, 0.036, 0.041, 0.037, 0.035, 0.037, 0.037, 0.038, 0.037, 0.036, 0.038, 0.041, 0.039, 0.044, 0.035, 0.037, 0.038, 0.039, 0.039, 0.039, 0.038, 0.037, 0.037, 0.035, 0.038, 0.039, 0.036, 0.037, 0.037, 0.037, 0.037, 0.038, 0.041, 0.041, 0.041, 0.04, 0.038, 0.038, 0.039, 0.035, 0.037, 0.038, 0.038, 0.033, 0.04, 0.036, 0.039, 0.037, 0.039, 0.037, 0.038, 0.039, 0.036, 0.039, 0.037, 0.038, 0.039, 0.038, 0.038, 0.038, 0.036, 0.036, 0.035, 0.037, 0.035, 0.038, 0.039, 0.038, 0.035, 0.037, 0.036, 0.039, 0.038, 0.038, 0.036, 0.035, 0.036, 0.039, 0.039, 0.035, 0.04, 0.037, 0.035]\n",
      "[Jan 14, 20:45:22] [0] \t\t #> Encoding 413 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0it [00:00, ?it/s]\n",
      "  0%|          | 0/7 [00:00<?, ?it/s]\u001b[A\n",
      " 14%|█▍        | 1/7 [00:05<00:35,  5.97s/it]\u001b[A\n",
      " 29%|██▊       | 2/7 [00:11<00:29,  5.83s/it]\u001b[A\n",
      " 43%|████▎     | 3/7 [00:17<00:23,  5.82s/it]\u001b[A\n",
      " 57%|█████▋    | 4/7 [00:23<00:17,  5.84s/it]\u001b[A\n",
      " 71%|███████▏  | 5/7 [00:29<00:11,  5.83s/it]\u001b[A\n",
      " 86%|████████▌ | 6/7 [00:35<00:05,  5.85s/it]\u001b[A\n",
      "100%|██████████| 7/7 [00:37<00:00,  5.37s/it]\u001b[A\n",
      "1it [00:38, 38.35s/it]\n",
      "100%|██████████| 1/1 [00:00<00:00, 2763.05it/s]\n",
      "100%|██████████| 4096/4096 [00:00<00:00, 266979.58it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 14, 20:46:01] #> Optimizing IVF to store map from centroids to list of pids..\n",
      "[Jan 14, 20:46:01] #> Building the emb2pid mapping..\n",
      "[Jan 14, 20:46:01] len(emb2pid) = 72541\n",
      "[Jan 14, 20:46:01] #> Saved optimized IVF to .ragatouille/colbert/indexes/llama2/ivf.pid.pt\n",
      "#> Joined...\n",
      "Done indexing!\n"
     ]
    }
   ],
   "source": [
    "list_pdf_documents = [document.text for document in documents]\n",
    "\n",
    "\n",
    "RAG.index(\n",
    "    collection=list_pdf_documents,\n",
    "    index_name=\"llama2\",\n",
    "    max_document_length=256,\n",
    "    split_documents=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'content': 'They enable interaction with humans through intuitive\\nchat interfaces, which has led to rapid and widespread adoption among the general public.\\nThecapabilitiesofLLMsareremarkableconsideringtheseeminglystraightforwardnatureofthetraining\\nmethodology. Auto-regressivetransformersarepretrainedonanextensivecorpusofself-superviseddata,\\nfollowed by alignment with human preferences via techniques such as Reinforcement Learning with Human\\nFeedback(RLHF).Althoughthetrainingmethodologyissimple,highcomputationalrequirementshave\\nlimited the development of LLMs to a few players. There have been public releases of pretrained LLMs\\n(such as BLOOM (Scao et al., 2022), LLaMa-1 (Touvron et al., 2023), and Falcon (Penedo et al., 2023)) that\\nmatch the performance of closed pretrained competitors like GPT-3 (Brown et al., 2020) and Chinchilla\\n(Hoffmann et al., 2022), but none of these models are suitable substitutes for closed “product” LLMs, such\\nasChatGPT,BARD,andClaude.',\n",
       "  'score': 22.288442611694336,\n",
       "  'rank': 1},\n",
       " {'content': 'Consequently,themodel’sperformanceinlanguagesotherthan\\nEnglish remains fragile and should be used with caution.\\nLike other LLMs, Llama 2 may generate harmful, offensive, or biased content due to its training on publicly\\navailable online datasets. We attempted to mitigate this via fine-tuning, but some issues may remain,\\nparticularlyforlanguagesotherthanEnglish wherepubliclyavailable datasetswerenotavailable. Wewill\\ncontinue to fine-tune and release updated versions in the future as we progress on addressing these issues.\\n‡‡https://openai.com/blog/chatgpt-plugins\\n34',\n",
       "  'score': 22.035356521606445,\n",
       "  'rank': 2},\n",
       " {'content': 'Known LLM Safety Challenges. Recent literature has extensively explored the risks and challenges linked\\nwith Large Language Models. Bender et al. (2021b) and Weidinger et al. (2021) underscore various hazards\\nlikebias,toxicity,privatedataleakage,andthepotentialformalicioususes. Solaimanetal.(2023)categorizes\\ntheseimpactsintotwogroups—thosethatcanbeassessedwithinthebasesystemandthoserequiringa\\nsocietal context evaluation, while Kumar et al. (2022) offers potential mitigation strategies to curb harm.\\nWorkfromRolleretal.(2020)andDinanetal.(2021)alsoilluminatesthedifficultiestiedtochatbot-oriented\\nLLMs, with concerns ranging from privacy to misleading expertise claims. Deng et al. (2023) proposes\\na taxonomic framework to tackle these issues, and Bergman et al. (2022) delves into the balance between\\npotential positive and negative impacts from releasing dialogue models.',\n",
       "  'score': 20.13809585571289,\n",
       "  'rank': 3}]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k = 3\n",
    "# search in 50 documents for relevant papers\n",
    "results = RAG.search(query=\"Public releases of other llms\", k=k, index_name=\"llama2\")\n",
    "results"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
